| Name | NetID | |
|---|---|---|
| Artsiom Strok | astrok2 | astrok2@illinois.edu |
| Lin Jiang | linj3 | linj3@illinois.edu |
| Mayank Chhablani | mchhab2 | mchhab2@illinois.edu |
Life expectancy has increased dramatically over the last few centuries. Since 1900 the global average life expectancy has more than doubled and there has been a huge development in health sector in the past 15 years resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project, only data from year 2000-2015 is considered for exploration and analysis.
This dataset is a compilation of data from the Global Health Observatory (GHO) and United Nations. The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources). The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).
This dataset is cleaned by removing some missing values, maily for population, Hepatitis B and GDP from less known countries and shared on Kaggle website (https://www.kaggle.com/kumarajarshi/life-expectancy-who). The final dataset contains 2938 observations with 22 variables which are more critical and representative among all the categories of health-related factors from year 2000 - 2015 for 193 countries.
The description of each variable for this dataset is listed below:
Numerical Response
Life expectancy: Life Expectancy in ageNumerical Predictors
Year: YearAdult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)infant: deathsNumber of Infant Deaths per 1000 populationAlcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)Measles: Measles - number of reported cases per 1000 populationBMI: Average Body Mass Index of entire populationunder-five deaths: Number of under-five deaths per 1000 populationPolio: Polio (Pol3) immunization coverage among 1-year-olds (%)Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)GDP: Gross Domestic Product per capita (in USD)Population: Population of the countrythinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)Schooling: Number of years of Schooling(years)Categotical Predictors
Country: CountryStatus: Developed or Developing statusThe GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources).
The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).
By combining the data from these two databases and exploring the interactions between the variables such as immunization, moratlity, education and economic factors, we expect to build a robust multiple linear regression model to predict the life expectancy more accurately.
For this project, we would like to:
library(knitr)
data = read.csv("Life Expectancy Data.csv")
kable(t(data[sample(nrow(data), 5), ]))
| 1222 | 999 | 2694 | 1382 | 2561 | |
|---|---|---|---|---|---|
| Country | Iran (Islamic Republic of) | Germany | Turkey | Kiribati | Tajikistan |
| Year | 2012 | 2011 | 2003 | 2012 | 2008 |
| Status | Developing | Developed | Developing | Developing | Developing |
| Life.expectancy | 75.1 | 85.0 | 71.6 | 65.7 | 66.4 |
| Adult.Mortality | 85 | 74 | 135 | 24 | 178 |
| infant.deaths | 21 | 2 | 35 | 0 | 10 |
| Alcohol | 0.01 | 11.20 | 1.42 | 0.01 | 0.32 |
| percentage.expenditure | 0.000 | 869.791 | 459.111 | 147.455 | 4.153 |
| Hepatitis.B | 98 | 88 | 68 | 94 | 86 |
| Measles | 332 | 1607 | 5844 | 0 | 0 |
| BMI | 56.0 | 6.4 | 55.9 | 76.2 | 35.3 |
| under.five.deaths | 24 | 3 | 43 | 0 | 12 |
| Polio | 99 | 94 | 69 | 92 | 87 |
| Total.expenditure | 6.98 | 1.93 | 5.34 | 1.37 | 5.58 |
| Diphtheria | 99 | 95 | 68 | 94 | 86 |
| HIV.AIDS | 0.1 | 0.1 | 0.1 | 0.1 | 0.3 |
| GDP | NA | 4681.33 | 4718.51 | 1763.82 | 76.91 |
| Population | NA | 8274983 | 668583 | 16613 | 739728 |
| thinness..1.19.years | 8.0 | 1.1 | 5.1 | 0.1 | 3.8 |
| thinness.5.9.years | 8.2 | 1.1 | 5.0 | 0.1 | 3.8 |
| Income.composition.of.resources | 0.755 | 0.912 | 0.668 | 0.581 | 0.592 |
| Schooling | 13.6 | 16.8 | 11.9 | 11.9 | 10.8 |
original_data = read.csv("Life Expectancy Data.csv")
kable(t(original_data[sample(nrow(original_data), 5), ]))
| 1380 | 2304 | 1125 | 1029 | 2525 | |
|---|---|---|---|---|---|
| Country | Kiribati | Sierra Leone | Haiti | Greece | Switzerland |
| Year | 2014 | 2009 | 2013 | 2013 | 2012 |
| Status | Developing | Developing | Developing | Developing | Developed |
| Life.expectancy | 66.1 | 47.1 | 62.7 | 86.0 | 82.7 |
| Adult.Mortality | 2 | 433 | 253 | 74 | 54 |
| infant.deaths | 0 | 28 | 14 | 0 | 0 |
| Alcohol | 0.01 | 3.97 | 5.68 | 7.46 | 9.86 |
| percentage.expenditure | 97.87 | 49.84 | 4.99 | 2183.11 | 18379.33 |
| Hepatitis.B | 75 | 84 | 68 | 98 | NA |
| Measles | 0 | 31 | 0 | 3 | 61 |
| BMI | 77.1 | 21.2 | 47.7 | 65.4 | 56.2 |
| under.five.deaths | 0 | 42 | 19 | 0 | 0 |
| Polio | 79 | 81 | 67 | 99 | 96 |
| Total.expenditure | 1.21 | 13.13 | 8.10 | 9.26 | 11.59 |
| Diphtheria | 75 | 84 | 68 | 99 | 96 |
| HIV.AIDS | 0.1 | 1.7 | 0.5 | 0.1 | 0.1 |
| GDP | 1684.54 | 394.59 | 81.27 | 21874.82 | 83164.39 |
| Population | 11458 | 63126 | 1431776 | 1965211 | 7996861 |
| thinness..1.19.years | 0.1 | 8.5 | 3.9 | 0.8 | 0.5 |
| thinness.5.9.years | 0.1 | 8.4 | 3.9 | 0.7 | 0.3 |
| Income.composition.of.resources | 0.597 | 0.375 | 0.483 | 0.860 | 0.932 |
| Schooling | 11.9 | 8.5 | 9.1 | 17.1 | 15.7 |
kable(sort(colSums(is.na(original_data)), decreasing = TRUE), col.names = "Number of missing values")
| Number of missing values | |
|---|---|
| Population | 652 |
| Hepatitis.B | 553 |
| GDP | 448 |
| Total.expenditure | 226 |
| Alcohol | 194 |
| Income.composition.of.resources | 167 |
| Schooling | 163 |
| BMI | 34 |
| thinness..1.19.years | 34 |
| thinness.5.9.years | 34 |
| Polio | 19 |
| Diphtheria | 19 |
| Life.expectancy | 10 |
| Adult.Mortality | 10 |
| Country | 0 |
| Year | 0 |
| Status | 0 |
| infant.deaths | 0 |
| percentage.expenditure | 0 |
| Measles | 0 |
| under.five.deaths | 0 |
| HIV.AIDS | 0 |
1289 samples have at least one missing value. Alcohol is missing in 194 samples all of them belongs to 2015 year and these countiries definately should have alcohol consuption more than 0. The data was collected in 2015 when data about alcohol consumption simply was not available.
Life expextancy and adult mortality is missing for 10 samples in 2013, all of them belongs to islands.
Hepatitis B is missing in 553 samples. Samples belongs to different countries and years.
nrow(original_data)
## [1] 2938
data = na.omit(original_data)
nrow(original_data) - nrow(data)
## [1] 1289
summary(data)
## Country Year Status Life.expectancy
## Afghanistan: 16 Min. :2000 Developed : 242 Min. :44.0
## Albania : 16 1st Qu.:2005 Developing:1407 1st Qu.:64.4
## Armenia : 15 Median :2008 Median :71.7
## Austria : 15 Mean :2008 Mean :69.3
## Belarus : 15 3rd Qu.:2011 3rd Qu.:75.0
## Belgium : 15 Max. :2015 Max. :89.0
## (Other) :1557
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1 Min. : 0.0 Min. : 0.01 Min. : 0
## 1st Qu.: 77 1st Qu.: 1.0 1st Qu.: 0.81 1st Qu.: 37
## Median :148 Median : 3.0 Median : 3.79 Median : 145
## Mean :168 Mean : 32.6 Mean : 4.53 Mean : 699
## 3rd Qu.:227 3rd Qu.: 22.0 3rd Qu.: 7.34 3rd Qu.: 509
## Max. :723 Max. :1600.0 Max. :17.87 Max. :18961
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.0 Min. : 0 Min. : 2.0 Min. : 0.0
## 1st Qu.:74.0 1st Qu.: 0 1st Qu.:19.5 1st Qu.: 1.0
## Median :89.0 Median : 15 Median :43.7 Median : 4.0
## Mean :79.2 Mean : 2224 Mean :38.1 Mean : 44.2
## 3rd Qu.:96.0 3rd Qu.: 373 3rd Qu.:55.8 3rd Qu.: 29.0
## Max. :99.0 Max. :131441 Max. :77.1 Max. :2100.0
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.0 Min. : 0.74 Min. : 2.0 Min. : 0.10
## 1st Qu.:81.0 1st Qu.: 4.41 1st Qu.:82.0 1st Qu.: 0.10
## Median :93.0 Median : 5.84 Median :92.0 Median : 0.10
## Mean :83.6 Mean : 5.96 Mean :84.2 Mean : 1.98
## 3rd Qu.:97.0 3rd Qu.: 7.47 3rd Qu.:97.0 3rd Qu.: 0.70
## Max. :99.0 Max. :14.39 Max. :99.0 Max. :50.60
##
## GDP Population thinness..1.19.years thinness.5.9.years
## Min. : 2 Min. :3.40e+01 Min. : 0.10 Min. : 0.10
## 1st Qu.: 462 1st Qu.:1.92e+05 1st Qu.: 1.60 1st Qu.: 1.70
## Median : 1593 Median :1.42e+06 Median : 3.00 Median : 3.20
## Mean : 5566 Mean :1.47e+07 Mean : 4.85 Mean : 4.91
## 3rd Qu.: 4719 3rd Qu.:7.66e+06 3rd Qu.: 7.10 3rd Qu.: 7.10
## Max. :119173 Max. :1.29e+09 Max. :27.20 Max. :28.20
##
## Income.composition.of.resources Schooling
## Min. :0.000 Min. : 4.2
## 1st Qu.:0.509 1st Qu.:10.3
## Median :0.673 Median :12.3
## Mean :0.632 Mean :12.1
## 3rd Qu.:0.751 3rd Qu.:14.0
## Max. :0.936 Max. :20.7
##
Looking at the summary data we can already see some inconsistencies. In Infant Deaths we see that the max value listed is 1600 which doesn’t make sense since we’re working with per 1000 population data. The same or similar numbers we can see for Infant deaths, Measles, Under five deaths
boxplot(data$Adult.Mortality)
#kable(t(head(data[data$Adult.Mortality > boxplot(data$Adult.Mortality)$stats[5], ])))
boxplot(data$infant.deaths)
#kable(t(head(data[data$infant.deaths > boxplot(data$infant.deaths)$stats[5], ])))
boxplot(data$Measles)
#kable(t(head(data[data$Measles > boxplot(data$Measles)$stats[5], ])))
boxplot(data$under.five.deaths)
#kable(t(head(data[data$under.five.deaths > boxplot(data$under.five.deaths)$stats[5], ])))
ggplot(gather(data[-c(1, 3)]), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
ggplot(data, aes(x = Status, y = Life.expectancy)) + geom_boxplot()
ggplot(data, aes(x = Adult.Mortality, y = Life.expectancy)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
This original dataset contains Country variable which we would not use for model building. Therefore, this column is excluded from the following data analysis as well as the records with NA in some of the columns mentioned above. This will reduce the size of the original dataset from 2938 to 1649 rows, which captures majority of the information and allows the speed of modeling to be more efficient.
Below lists the summary and structure of the cleaned dataset.
summary(data)
## Country Year Status Life.expectancy
## Afghanistan: 16 Min. :2000 Developed : 242 Min. :44.0
## Albania : 16 1st Qu.:2005 Developing:1407 1st Qu.:64.4
## Armenia : 15 Median :2008 Median :71.7
## Austria : 15 Mean :2008 Mean :69.3
## Belarus : 15 3rd Qu.:2011 3rd Qu.:75.0
## Belgium : 15 Max. :2015 Max. :89.0
## (Other) :1557
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1 Min. : 0.0 Min. : 0.01 Min. : 0
## 1st Qu.: 77 1st Qu.: 1.0 1st Qu.: 0.81 1st Qu.: 37
## Median :148 Median : 3.0 Median : 3.79 Median : 145
## Mean :168 Mean : 32.6 Mean : 4.53 Mean : 699
## 3rd Qu.:227 3rd Qu.: 22.0 3rd Qu.: 7.34 3rd Qu.: 509
## Max. :723 Max. :1600.0 Max. :17.87 Max. :18961
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.0 Min. : 0 Min. : 2.0 Min. : 0.0
## 1st Qu.:74.0 1st Qu.: 0 1st Qu.:19.5 1st Qu.: 1.0
## Median :89.0 Median : 15 Median :43.7 Median : 4.0
## Mean :79.2 Mean : 2224 Mean :38.1 Mean : 44.2
## 3rd Qu.:96.0 3rd Qu.: 373 3rd Qu.:55.8 3rd Qu.: 29.0
## Max. :99.0 Max. :131441 Max. :77.1 Max. :2100.0
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.0 Min. : 0.74 Min. : 2.0 Min. : 0.10
## 1st Qu.:81.0 1st Qu.: 4.41 1st Qu.:82.0 1st Qu.: 0.10
## Median :93.0 Median : 5.84 Median :92.0 Median : 0.10
## Mean :83.6 Mean : 5.96 Mean :84.2 Mean : 1.98
## 3rd Qu.:97.0 3rd Qu.: 7.47 3rd Qu.:97.0 3rd Qu.: 0.70
## Max. :99.0 Max. :14.39 Max. :99.0 Max. :50.60
##
## GDP Population thinness..1.19.years thinness.5.9.years
## Min. : 2 Min. :3.40e+01 Min. : 0.10 Min. : 0.10
## 1st Qu.: 462 1st Qu.:1.92e+05 1st Qu.: 1.60 1st Qu.: 1.70
## Median : 1593 Median :1.42e+06 Median : 3.00 Median : 3.20
## Mean : 5566 Mean :1.47e+07 Mean : 4.85 Mean : 4.91
## 3rd Qu.: 4719 3rd Qu.:7.66e+06 3rd Qu.: 7.10 3rd Qu.: 7.10
## Max. :119173 Max. :1.29e+09 Max. :27.20 Max. :28.20
##
## Income.composition.of.resources Schooling
## Min. :0.000 Min. : 4.2
## 1st Qu.:0.509 1st Qu.:10.3
## Median :0.673 Median :12.3
## Mean :0.632 Mean :12.1
## 3rd Qu.:0.751 3rd Qu.:14.0
## Max. :0.936 Max. :20.7
##
data = na.omit(data)
data = data[-1] #exclude country from dataset
str(data)
## 'data.frame': 1649 obs. of 21 variables:
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
Let’s take a look of the dataset using plots of Life Expectancy vs Status or Year. The boxplot indicates that there is significant difference in Life Expectancy between the Developed and Developing countries. As expected, the Life Expectancy increases as the Year passes by and the Violin plot shows that the data is well distributed across different years.
par(mfrow = c(1, 2))
# Histogram of Life Expectancy
hist(
data$Life.expectancy,
xlab = "Life Expectancy",
main = "Distribution of Life Expectancy",
col = "dodgerblue",
breaks = 25
)
# Boxplot of Life Expectancy vs Country Status (Developed vs Developing)
plot(
data$Status,
data$Life.expectancy,
xlab = "Status",
ylab = "Life Expectancy",
main = "Life Expectancy vs. Status",
col = c(2, 3)
)
# Violin plot of Life Expectancy vs. Year
data %>% ggplot() + geom_violin(aes(
x = Year,
y = Life.expectancy,
group = Year,
fill = Year
)) + labs(title = "Life Expectancy vs. Year")
Colinearity issue is visualized using the following plots. Resulsts show that some predictors have strong collinearity issues, such as infant.deaths vs. under.five.deaths, GDP vs. percentage.expenditure, Population vs. thinness..1.19.years etc.
corrplot(cor(data[-c(2)]))
ggpairs(data[-c(2)])
round(cor(data[-c(2)]),2)
## Year Life.expectancy Adult.Mortality
## Year 1.00 0.05 -0.04
## Life.expectancy 0.05 1.00 -0.70
## Adult.Mortality -0.04 -0.70 1.00
## infant.deaths 0.01 -0.17 0.04
## Alcohol -0.11 0.40 -0.18
## percentage.expenditure 0.07 0.41 -0.24
## Hepatitis.B 0.11 0.20 -0.11
## Measles -0.05 -0.07 0.00
## BMI 0.01 0.54 -0.35
## under.five.deaths 0.01 -0.19 0.06
## Polio -0.02 0.33 -0.20
## Total.expenditure 0.06 0.17 -0.09
## Diphtheria 0.03 0.34 -0.19
## HIV.AIDS -0.12 -0.59 0.55
## GDP 0.10 0.44 -0.26
## Population 0.01 -0.02 -0.02
## thinness..1.19.years 0.02 -0.46 0.27
## thinness.5.9.years 0.01 -0.46 0.29
## Income.composition.of.resources 0.12 0.72 -0.44
## Schooling 0.09 0.73 -0.42
## infant.deaths Alcohol percentage.expenditure
## Year 0.01 -0.11 0.07
## Life.expectancy -0.17 0.40 0.41
## Adult.Mortality 0.04 -0.18 -0.24
## infant.deaths 1.00 -0.11 -0.09
## Alcohol -0.11 1.00 0.42
## percentage.expenditure -0.09 0.42 1.00
## Hepatitis.B -0.23 0.11 0.02
## Measles 0.53 -0.05 -0.06
## BMI -0.23 0.35 0.24
## under.five.deaths 1.00 -0.10 -0.09
## Polio -0.16 0.24 0.13
## Total.expenditure -0.15 0.21 0.18
## Diphtheria -0.16 0.24 0.13
## HIV.AIDS 0.01 -0.03 -0.10
## GDP -0.10 0.44 0.96
## Population 0.67 -0.03 -0.02
## thinness..1.19.years 0.46 -0.40 -0.26
## thinness.5.9.years 0.46 -0.39 -0.26
## Income.composition.of.resources -0.13 0.56 0.40
## Schooling -0.21 0.62 0.42
## Hepatitis.B Measles BMI under.five.deaths
## Year 0.11 -0.05 0.01 0.01
## Life.expectancy 0.20 -0.07 0.54 -0.19
## Adult.Mortality -0.11 0.00 -0.35 0.06
## infant.deaths -0.23 0.53 -0.23 1.00
## Alcohol 0.11 -0.05 0.35 -0.10
## percentage.expenditure 0.02 -0.06 0.24 -0.09
## Hepatitis.B 1.00 -0.12 0.14 -0.24
## Measles -0.12 1.00 -0.15 0.52
## BMI 0.14 -0.15 1.00 -0.24
## under.five.deaths -0.24 0.52 -0.24 1.00
## Polio 0.46 -0.06 0.19 -0.17
## Total.expenditure 0.11 -0.11 0.19 -0.15
## Diphtheria 0.59 -0.06 0.18 -0.18
## HIV.AIDS -0.09 0.00 -0.21 0.02
## GDP 0.04 -0.06 0.27 -0.10
## Population -0.13 0.32 -0.08 0.66
## thinness..1.19.years -0.13 0.18 -0.55 0.46
## thinness.5.9.years -0.13 0.17 -0.55 0.46
## Income.composition.of.resources 0.18 -0.06 0.51 -0.15
## Schooling 0.22 -0.12 0.55 -0.23
## Polio Total.expenditure Diphtheria HIV.AIDS
## Year -0.02 0.06 0.03 -0.12
## Life.expectancy 0.33 0.17 0.34 -0.59
## Adult.Mortality -0.20 -0.09 -0.19 0.55
## infant.deaths -0.16 -0.15 -0.16 0.01
## Alcohol 0.24 0.21 0.24 -0.03
## percentage.expenditure 0.13 0.18 0.13 -0.10
## Hepatitis.B 0.46 0.11 0.59 -0.09
## Measles -0.06 -0.11 -0.06 0.00
## BMI 0.19 0.19 0.18 -0.21
## under.five.deaths -0.17 -0.15 -0.18 0.02
## Polio 1.00 0.12 0.61 -0.11
## Total.expenditure 0.12 1.00 0.13 0.04
## Diphtheria 0.61 0.13 1.00 -0.12
## HIV.AIDS -0.11 0.04 -0.12 1.00
## GDP 0.16 0.18 0.16 -0.11
## Population -0.05 -0.08 -0.04 -0.03
## thinness..1.19.years -0.16 -0.21 -0.19 0.17
## thinness.5.9.years -0.17 -0.22 -0.18 0.18
## Income.composition.of.resources 0.31 0.18 0.34 -0.25
## Schooling 0.35 0.24 0.35 -0.21
## GDP Population thinness..1.19.years
## Year 0.10 0.01 0.02
## Life.expectancy 0.44 -0.02 -0.46
## Adult.Mortality -0.26 -0.02 0.27
## infant.deaths -0.10 0.67 0.46
## Alcohol 0.44 -0.03 -0.40
## percentage.expenditure 0.96 -0.02 -0.26
## Hepatitis.B 0.04 -0.13 -0.13
## Measles -0.06 0.32 0.18
## BMI 0.27 -0.08 -0.55
## under.five.deaths -0.10 0.66 0.46
## Polio 0.16 -0.05 -0.16
## Total.expenditure 0.18 -0.08 -0.21
## Diphtheria 0.16 -0.04 -0.19
## HIV.AIDS -0.11 -0.03 0.17
## GDP 1.00 -0.02 -0.28
## Population -0.02 1.00 0.28
## thinness..1.19.years -0.28 0.28 1.00
## thinness.5.9.years -0.28 0.28 0.93
## Income.composition.of.resources 0.45 -0.01 -0.45
## Schooling 0.47 -0.04 -0.49
## thinness.5.9.years
## Year 0.01
## Life.expectancy -0.46
## Adult.Mortality 0.29
## infant.deaths 0.46
## Alcohol -0.39
## percentage.expenditure -0.26
## Hepatitis.B -0.13
## Measles 0.17
## BMI -0.55
## under.five.deaths 0.46
## Polio -0.17
## Total.expenditure -0.22
## Diphtheria -0.18
## HIV.AIDS 0.18
## GDP -0.28
## Population 0.28
## thinness..1.19.years 0.93
## thinness.5.9.years 1.00
## Income.composition.of.resources -0.44
## Schooling -0.47
## Income.composition.of.resources Schooling
## Year 0.12 0.09
## Life.expectancy 0.72 0.73
## Adult.Mortality -0.44 -0.42
## infant.deaths -0.13 -0.21
## Alcohol 0.56 0.62
## percentage.expenditure 0.40 0.42
## Hepatitis.B 0.18 0.22
## Measles -0.06 -0.12
## BMI 0.51 0.55
## under.five.deaths -0.15 -0.23
## Polio 0.31 0.35
## Total.expenditure 0.18 0.24
## Diphtheria 0.34 0.35
## HIV.AIDS -0.25 -0.21
## GDP 0.45 0.47
## Population -0.01 -0.04
## thinness..1.19.years -0.45 -0.49
## thinness.5.9.years -0.44 -0.47
## Income.composition.of.resources 1.00 0.78
## Schooling 0.78 1.00
# train test split 70/30 hold out
train_size = floor(0.7 * nrow(data))
train_idx = sample(nrow(data), train_size)
data.train = data[train_idx, ]
data.test = data[-train_idx, ]
Steps to generate best model: We will start with simple additive model and then use AIC/ BIC to weed out unnecessary predictors. We then, may try out interactions and see if that improves our model and later, we may use response or, predictor transformations to improve our selected model.
Additve Model with all predictors
model.additive = lm(Life.expectancy ~ ., data.train)
#summary(model.additive)
par(mfrow = c(2, 2))
plot(model.additive)
So, looking at Residual vs Fitted plot that we may have issue of linearity and equal-variance. Similary, we need to check for normality assumption as dictated by Normal Q-Q plot
Let’s check for Homoscedasticityand Normality assumptions.
show_metrics(model.additive)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.19638743734941e-21 | 0.000059802581149761 | 0.844706587093747 | 3.50866876426059 | 11.8303259589542 | 14.4884239442948 |
So, from above data it looks liek although Adjusted R^2 and RMSE is better, but, we REJECT both Homoscedasticityand Normality assumptions. Also, from summary of model it seems there is need to prune out some predictors.
Let’s apply AIC and BIC to see if we can improve our model
model.additive.step.aic = step(model.additive, trace = FALSE)
#summary(model.additive.step.aic)
par(mfrow = c(2, 2))
plot(model.additive.step.aic)
So, looking at AIC model, we can see that GDPvs percentage.expenditure collinearity issue has been removed by AIC. Also, some of the non-significant predictors are also removed.
show_metrics(model.additive.step.aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.61540494570909e-22 | 0.0000318962005371406 | 0.84483096513633 | 3.50080398499081 | 11.8834501668962 | 14.4433330629638 |
Apart from reduction in predictors, there isn’t a significant improvement. Let’s try out BIC:
model.additive.step.bic = step(model.additive, k = log(nrow(data.train)), trace = FALSE)
#summary(model.additive.step.bic)
par(mfrow = c(2, 2))
plot(model.additive.step.bic)
show_metrics(model.additive.step.bic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.12322671490772e-20 | 0.0000133651092912547 | 0.842918941917074 | 3.51818636644947 | 12.0721275312972 | 14.5125336442044 |
So, BIC and AIC resutls are almost same and so does the evaluatin metric but, we saw two improvements: (1) Reduction in number of predictors (2) Reduction in influential points as per the Residual vs Leverage graph.
Hence, let’s select AIC model and try interactions. The BIC is little aggressive in removing predictors so, we are selecting AIC additive model.
Also, let’s check for variance-inflation-factor
vif_aic_all_model = vif(model.additive.step.aic)
vif_aic_all_model[vif_aic_all_model > 5]
## infant.deaths under.five.deaths
## 225.2 220.1
As evident from above VIF computation, infant.deaths and under.five.deaths in the model are causing collinearity issue. This same issue we found during our correlation plot. Let’s try to improve the model and remove issues which are violating linearity assumptions.
Applying Two-Way Interaction:
model.interaction = lm(Life.expectancy ~ . ^ 2, data = data.train)
Since, this will explode the number of predictors let’s generate AIC and BIC models for this interaction model.
AIC Interaction Model:
model.interaction.step.aic = step(model.interaction, trace = 0)
#summary(model.interaction.step.aic)
par(mfrow = c(2, 2))
plot(model.interaction.step.aic)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
show_metrics(model.interaction.step.aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00598315752593769 | 5.02438486442735e-10 | 0.918642514611818 | 2.89436801651001 | 5.59611868899128 | 15.3358935315704 |
AIC and BIC model are again somewhat similar. Total numbr of predictors used by AIC is 119 and we can see that with interactions we have made improvement in adjusted r_squared and bptest. Let’s see the VIF imapct
vif_interaction_model = vif(model.interaction.step.aic)
vif_interaction_model[vif_interaction_model > 5]
## Year
## 9.463e+01
## StatusDeveloping
## 1.206e+03
## Adult.Mortality
## 7.079e+05
## infant.deaths
## 3.197e+08
## Alcohol
## 1.245e+02
## percentage.expenditure
## 8.236e+06
## Hepatitis.B
## 7.088e+01
## Measles
## 8.146e+02
## BMI
## 1.094e+02
## under.five.deaths
## 3.528e+08
## Polio
## 6.172e+01
## Total.expenditure
## 4.558e+05
## Diphtheria
## 1.709e+02
## HIV.AIDS
## 1.108e+06
## GDP
## 8.047e+06
## Population
## 3.971e+06
## thinness..1.19.years
## 9.647e+06
## thinness.5.9.years
## 9.988e+06
## Income.composition.of.resources
## 1.820e+06
## Schooling
## 1.380e+06
## Year:Adult.Mortality
## 7.089e+05
## Year:infant.deaths
## 3.238e+08
## Year:percentage.expenditure
## 8.243e+06
## Year:under.five.deaths
## 3.577e+08
## Year:Total.expenditure
## 4.577e+05
## Year:HIV.AIDS
## 1.107e+06
## Year:GDP
## 8.075e+06
## Year:Population
## 4.022e+06
## Year:thinness..1.19.years
## 9.691e+06
## Year:thinness.5.9.years
## 1.003e+07
## Year:Income.composition.of.resources
## 1.825e+06
## Year:Schooling
## 1.389e+06
## StatusDeveloping:Adult.Mortality
## 8.924e+01
## StatusDeveloping:infant.deaths
## 1.800e+06
## StatusDeveloping:Hepatitis.B
## 6.603e+01
## StatusDeveloping:under.five.deaths
## 2.288e+06
## StatusDeveloping:Total.expenditure
## 2.777e+01
## StatusDeveloping:Diphtheria
## 4.078e+02
## StatusDeveloping:thinness..1.19.years
## 3.142e+04
## StatusDeveloping:thinness.5.9.years
## 2.619e+04
## StatusDeveloping:Income.composition.of.resources
## 1.092e+03
## StatusDeveloping:Schooling
## 2.695e+02
## Adult.Mortality:percentage.expenditure
## 4.291e+01
## Adult.Mortality:BMI
## 1.228e+01
## Adult.Mortality:under.five.deaths
## 4.157e+01
## Adult.Mortality:HIV.AIDS
## 1.225e+01
## Adult.Mortality:GDP
## 4.236e+01
## Adult.Mortality:Population
## 2.845e+01
## Adult.Mortality:Income.composition.of.resources
## 5.368e+01
## Adult.Mortality:Schooling
## 9.330e+01
## infant.deaths:Alcohol
## 3.618e+01
## infant.deaths:percentage.expenditure
## 8.951e+02
## infant.deaths:Hepatitis.B
## 5.134e+03
## infant.deaths:BMI
## 1.648e+03
## infant.deaths:under.five.deaths
## 4.808e+02
## infant.deaths:Polio
## 2.376e+02
## infant.deaths:Diphtheria
## 2.092e+04
## infant.deaths:HIV.AIDS
## 1.872e+01
## infant.deaths:GDP
## 1.577e+03
## infant.deaths:Population
## 7.406e+04
## infant.deaths:thinness..1.19.years
## 1.385e+04
## infant.deaths:Income.composition.of.resources
## 4.829e+03
## Alcohol:Hepatitis.B
## 4.749e+01
## Alcohol:Total.expenditure
## 3.343e+01
## Alcohol:Diphtheria
## 1.126e+02
## Alcohol:HIV.AIDS
## 1.597e+01
## Alcohol:Population
## 2.349e+01
## Alcohol:thinness..1.19.years
## 1.078e+01
## percentage.expenditure:Hepatitis.B
## 2.800e+01
## percentage.expenditure:under.five.deaths
## 1.175e+03
## percentage.expenditure:Diphtheria
## 4.224e+02
## percentage.expenditure:GDP
## 2.275e+01
## percentage.expenditure:Income.composition.of.resources
## 2.767e+03
## percentage.expenditure:Schooling
## 2.626e+03
## Hepatitis.B:Measles
## 6.286e+01
## Hepatitis.B:under.five.deaths
## 5.359e+03
## Hepatitis.B:Total.expenditure
## 3.663e+01
## Hepatitis.B:Diphtheria
## 2.463e+01
## Hepatitis.B:Population
## 8.144e+01
## Hepatitis.B:thinness..1.19.years
## 5.593e+02
## Hepatitis.B:thinness.5.9.years
## 5.507e+02
## Measles:BMI
## 1.992e+01
## Measles:Polio
## 2.424e+02
## Measles:Total.expenditure
## 3.579e+01
## Measles:Diphtheria
## 3.718e+02
## Measles:HIV.AIDS
## 9.406e+00
## Measles:GDP
## 5.379e+00
## Measles:Population
## 1.781e+02
## Measles:thinness.5.9.years
## 2.053e+02
## Measles:Income.composition.of.resources
## 3.891e+02
## BMI:under.five.deaths
## 1.656e+03
## BMI:GDP
## 1.799e+01
## BMI:Population
## 4.187e+01
## BMI:thinness..1.19.years
## 1.681e+02
## BMI:thinness.5.9.years
## 1.615e+02
## BMI:Income.composition.of.resources
## 1.320e+02
## BMI:Schooling
## 1.883e+02
## under.five.deaths:Diphtheria
## 1.845e+04
## under.five.deaths:GDP
## 1.908e+03
## under.five.deaths:Population
## 6.835e+04
## under.five.deaths:thinness..1.19.years
## 1.394e+04
## under.five.deaths:thinness.5.9.years
## 9.257e+02
## under.five.deaths:Schooling
## 1.231e+03
## Polio:Total.expenditure
## 5.683e+01
## Polio:GDP
## 1.340e+02
## Polio:thinness..1.19.years
## 2.202e+02
## Polio:thinness.5.9.years
## 2.050e+02
## Polio:Income.composition.of.resources
## 1.081e+02
## Total.expenditure:Diphtheria
## 7.080e+01
## Total.expenditure:HIV.AIDS
## 3.156e+01
## Total.expenditure:GDP
## 1.741e+01
## Total.expenditure:Population
## 8.373e+01
## Total.expenditure:thinness..1.19.years
## 1.884e+01
## Total.expenditure:Income.composition.of.resources
## 6.184e+01
## Diphtheria:Income.composition.of.resources
## 1.361e+02
## Diphtheria:Schooling
## 1.910e+02
## HIV.AIDS:Population
## 5.303e+00
## HIV.AIDS:thinness..1.19.years
## 1.491e+02
## HIV.AIDS:thinness.5.9.years
## 1.469e+02
## GDP:thinness.5.9.years
## 6.555e+00
## GDP:Schooling
## 2.859e+03
## Population:thinness.5.9.years
## 4.160e+02
## Population:Schooling
## 7.387e+02
## thinness..1.19.years:thinness.5.9.years
## 6.667e+01
## thinness..1.19.years:Income.composition.of.resources
## 4.919e+03
## thinness..1.19.years:Schooling
## 3.574e+03
## thinness.5.9.years:Income.composition.of.resources
## 4.938e+03
## thinness.5.9.years:Schooling
## 3.794e+03
This interaction model is getting overly complicated. Let’s inspect pairs scatter plot.
pairs(data.frame(data$Life.expectancy, data$Year, data$Status, data$Adult.Mortality, data$infant.deaths, data$Alcohol, data$percentage.expenditure, data$BMI, data$under.five.deaths, data$Polio, data$Total.expenditure, data$HIV.AIDS, data$thinness.5.9.years, data$Income.composition.of.resources, data$Schooling))
Based on this cross-predictor plots, we confirm that (a) HIV.AIDS has non-linear relationship with Life.expectancy' and, (B)HIV.AIDSandAdult.Mortalityare also interacting non-linearly. Also, we may condsider removingYear` because it may divert model in different direction because of its chaotic relationship with the response.
Check for impact of high VIF predictors.
model_selected_wo_underfive = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_wo_infant.deaths = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_underfive = lm(
under.five.deaths ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_infant.deaths = lm(
infant.deaths ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
cor(resid(model_selected_underfive),
resid(model_selected_wo_underfive))
## [1] -0.2527
cor(resid(model_selected_infant.deaths),
resid(model_selected_wo_infant.deaths))
## [1] 0.2487
Based on this we can get rid of under.five.deaths
model_selected = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling ,
data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected)
show_metrics(model_selected)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.59572259383389e-29 | 0.0000817238216404277 | 0.833747060667492 | 3.62603598753409 | 12.7434773099163 | 14.5659873637219 |
vif(model_selected)
## Year StatusDeveloping
## 1.121 1.802
## Adult.Mortality infant.deaths
## 1.837 1.392
## Alcohol percentage.expenditure
## 2.246 1.418
## BMI Polio
## 1.784 1.199
## Total.expenditure HIV.AIDS
## 1.124 1.576
## thinness.5.9.years Income.composition.of.resources
## 1.996 3.263
## Schooling
## 3.679
Looks like none of our predictor is causing inflation in variance. Let’s consider interaction of these predictors.
model_selected_interactions = lm(
Life.expectancy ~ (
Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling
) ^ 2 ,
data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected_interactions)
show_metrics(model_selected_interactions)
## Warning in predict.lm(model, data.train): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(model, data.test): prediction from a rank-deficient fit
## may be misleading
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 4.84360148121369e-06 | 3.58602549267733e-09 | 0.89981396499582 | 2.96815448606524 | 7.16067923906753 | 10.356646923468 |
model_selected_interactions_aic = step(model_selected_interactions, trace = FALSE)
#summary(model_selected_interactions_aic)
show_metrics(model_selected_interactions_aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.44805155269171e-06 | 1.04271639578767e-09 | 0.901025514701528 | 2.87016492853926 | 7.27373007030712 | 10.5189876384089 |
model_selected_interactions_bic=step(model_selected_interactions, trace=0, k = log(nrow(data.train)))
#summary(model_selected_interactions_bic)
show_metrics(model_selected_interactions_bic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.09684726187374e-08 | 2.12464993566629e-10 | 0.896246382310163 | 2.9003927388848 | 7.77145266043916 | 10.7769449544755 |
par(mfrow=c(2,2))
plot(model_selected_interactions_bic)
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + infant.deaths:Polio + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)
show_metrics(model_selected_poly_interactions)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 9.90771175086473e-12 | 7.59134585416287e-06 | 0.85857538224716 | 3.34510211995818 | 10.811830900963 | 12.8744470461626 |
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)
show_metrics(model_selected_poly_interactions)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.22341712873213e-11 | 7.76432430757746e-06 | 0.857347354152492 | 3.35556741522419 | 10.915304600423 | 12.7046394784675 |
vif_poly = vif(model_selected_poly_interactions)
vif_poly[vif_poly > 5]
## HIV.AIDS I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS
## 13.817 9.387 6.205
length(coef(model_selected_poly_interactions))
## [1] 16
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
model_selected_poly_interactions_aic=step(model_selected_poly_interactions, trace=0, k=log(nrow(data)))
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions_aic)
show_metrics(model_selected_poly_interactions_aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.30367180064734e-08 | 2.64023093149332e-06 | 0.855313070446564 | 3.37128342500836 | 11.1196033881997 | 13.0238764269319 |
vif_poly = vif(model_selected_poly_interactions_aic)
vif_poly[vif_poly > 5]
## HIV.AIDS I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS
## 13.115 9.183 6.023
length(coef(model_selected_poly_interactions_aic))
## [1] 11
model_selected_outliers = lm(
Life.expectancy ~ Adult.Mortality +
infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources +
Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS ^ 2),
data = data.train,
subset = abs(rstandard(model_selected_poly_interactions_aic)) < 2
)
par(mfrow = c(2, 2))
plot(model_selected_outliers)
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
Based on above plots, we can say that we have achieved a good model based with 10 predictors and the formula is:
formula = Life.expectancy ~ Adult.Mortality + infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2)
Also, looking at the model it seems, we are satisfying assumptions of linear model.
Evaluation metric are as follows:
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
Box-Cox confirms that it is not necessary to transform y variable.
# Box-Cox Transformation
par(mfrow = c(1, 2))
boxcox(model_selected_outliers, plotit = TRUE)
boxcox(model_selected_outliers,
plotit = TRUE,
lambda = seq(0.5, 1.5, by = 0.1))
Explore the relationship between Life.expectancy and each predictor.
plot(Life.expectancy ~ Adult.Mortality, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Adult.Mortality")
plot(Life.expectancy ~ infant.deaths, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs infant.deaths")
plot(Life.expectancy ~ Polio, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Polio")
plot(Life.expectancy ~ HIV.AIDS, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs HIV.AIDS")
plot(Life.expectancy ~ thinness.5.9.years, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs thinness.5.9.years")
plot(Life.expectancy ~ Income.composition.of.resources, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Income.composition.of.resources")
plot(Life.expectancy ~ Schooling, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Schooling")
lm_le_am = lm(Life.expectancy ~ Adult.Mortality, data = data.train)
show_metrics(lm_le_am)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.93888870084312e-18 | 5.44652981244873e-27 | 0.484496184454426 | 6.32984080904256 | 39.9298914485206 | 37.4349744962051 |
lm_le_am_log = lm(Life.expectancy ~ log(Adult.Mortality), data = data.train)
show_metrics(lm_le_am_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0000109121008769902 | 4.88198600292786e-17 | 0.183382442308834 | 7.97190730155398 | 63.2535578792075 | 61.5102146650863 |
# bp.test result is significantly improved with log transformation of Adult.Mortality.
lm_le_am_poly = lm(Life.expectancy ~ Adult.Mortality + I(Adult.Mortality ^ 2), data = data.train)
show_metrics(lm_le_am_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.17932361175982e-25 | 2.76151917851856e-24 | 0.499632136383793 | 6.24283249651564 | 38.7238473195189 | 36.0427661597144 |
# Polynomial transformation doesn't work.
lm_le_id = lm(Life.expectancy ~ infant.deaths, data = data.train)
show_metrics(lm_le_id)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.02657747833814 | 8.27589915106478e-17 | 0.0284300058581662 | 8.69566904890079 | 75.2558627711913 | 74.8725034661356 |
# lm_le_id_log = lm(Life.expectancy ~ log(infant.deaths), data = data.train)
# couldn't perform log transformation because of O values.
lm_le_id_poly = lm(Life.expectancy ~ infant.deaths + I(infant.deaths ^ 2), data = data.train)
show_metrics(lm_le_id_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.000298263106736209 | 2.51398765428739e-15 | 0.0893880358135724 | 8.42415869521536 | 70.4729484696273 | 71.4095417460136 |
# Polynomial transformation of infant.deaths doesn't improve.
lm_le_po = lm(Life.expectancy ~ Polio, data = data.train)
show_metrics(lm_le_po)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.66387530899153e-06 | 1.15592951036877e-13 | 0.119174477527461 | 8.27688462546276 | 68.2269780296235 | 71.0842938162934 |
lm_le_po_log = lm(Life.expectancy ~ log(Polio), data = data.train)
show_metrics(lm_le_po_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0318322360510644 | 9.61459294344335e-16 | 0.0571717899249146 | 8.56104117786634 | 73.0295818335663 | 74.7811670558807 |
# bp.test result is improved with log transformation of Polio.
lm_le_po_poly = lm(Life.expectancy ~ Polio + I(Polio ^ 2), data = data.train)
show_metrics(lm_le_po_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0200803369616035 | 6.03245675563947e-12 | 0.235744603268793 | 7.71173926979946 | 59.1463030464273 | 60.9914024560467 |
# Polynomial transformation also helps.
lm_le_hiv = lm(Life.expectancy ~ HIV.AIDS, data = data.train)
show_metrics(lm_le_hiv)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.33841446544784e-10 | 7.77718772625768e-07 | 0.362320502385172 | 7.04958025653526 | 49.3933746964793 | 52.1499460150461 |
lm_le_hiv_log = lm(Life.expectancy ~ log(HIV.AIDS), data = data.train)
show_metrics(lm_le_hiv_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00398413313373982 | 5.03204393449706e-06 | 0.644648154468367 | 5.25460096055283 | 27.5248411170202 | 29.573112720294 |
# bp.test result is significantly improved with log transformation of HIV.AIDS.
lm_le_hiv_poly = lm(Life.expectancy ~ HIV.AIDS + I(HIV.AIDS ^ 2), data = data.train)
show_metrics(lm_le_hiv_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0468046150825987 | 1.55206717595535e-06 | 0.487029498855961 | 6.33118333697985 | 39.6991749673099 | 43.4723916195385 |
# Polynomial transformation also helps.
lm_le_hiv_log_poly = lm(Life.expectancy ~ log(HIV.AIDS) + I(HIV.AIDS ^ 2), data = data.train) # doesn't further improve
show_metrics(lm_le_hiv_log_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00686870383790006 | 5.87198474501198e-06 | 0.644741393902713 | 5.25324248579441 | 27.4937321554455 | 29.5878606382631 |
lm_le_thn = lm(Life.expectancy ~ thinness.5.9.years, data = data.train)
show_metrics(lm_le_thn)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.6062500515926e-20 | 1.91144004921282e-15 | 0.21336005686777 | 7.82538800946802 | 60.9315519907407 | 61.6658609753975 |
lm_le_thn_log = lm(Life.expectancy ~ log(thinness.5.9.years), data = data.train)
show_metrics(lm_le_thn_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.289116205119675 | 4.52200164957121e-21 | 0.316302351683306 | 7.29420952049337 | 52.9578483371689 | 57.4977265421249 |
# bp.test result is significantly improved with log transformation.
lm_le_thn_poly = lm(Life.expectancy ~ thinness.5.9.years + I(thinness.5.9.years ^ 2),
data = data.train)
show_metrics(lm_le_thn_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 9.56992897947193e-12 | 1.33708429290685e-12 | 0.362390018182186 | 7.04917200577612 | 49.3451186230709 | 51.7643974324823 |
# Polynomial transformation doesn't help a lot.
lm_le_icr = lm(Life.expectancy ~ Income.composition.of.resources, data = data.train)
show_metrics(lm_le_icr)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.39868610150089e-72 | 9.78699333659299e-26 | 0.559087343704458 | 5.87166945958373 | 34.1522098833106 | 45.020683088903 |
# lm_le_icr_log = lm(Life.expectancy ~ log(Income.composition.of.resources), data = data.train) couldn't perform log tranformation due to 0 values.
# bp.test result is significantly improved with log transformation.
lm_le_icr_poly = lm(
Life.expectancy ~ Income.composition.of.resources + I(Income.composition.of.resources ^ 2),
data = data.train
)
show_metrics(lm_le_icr_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.37401259204068e-21 | 1.9291859350126e-13 | 0.69484854321261 | 4.87547569854632 | 23.6159019817213 | 24.7461223228291 |
# Polynomial transformation helps a little.
lm_le_sch = lm(Life.expectancy ~ Schooling, data = data.train)
show_metrics(lm_le_sch)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.0127149476637e-09 | 2.6063913651081e-19 | 0.532073853020822 | 6.03052121613411 | 36.2446206824462 | 36.7431166111802 |
lm_le_sch_log = lm(Life.expectancy ~ log(Schooling), data = data.train)
show_metrics(lm_le_sch_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.3272402214625e-15 | 9.94327190004781e-20 | 0.498226490690044 | 6.24724276659015 | 38.8663694705834 | 39.1081458945406 |
# log transformation doesn't work.
lm_le_sch_poly = lm(Life.expectancy ~ Schooling + I(Schooling ^ 2), data = data.train)
show_metrics(lm_le_sch_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 4.23704752476234e-08 | 2.84765808491887e-19 | 0.531961805515695 | 6.03435983226363 | 36.2218297792506 | 36.7111420288311 |
# Polynomial transformation doesn't help very much.
# New model based on the transformation result
model_trans = lm(Life.expectancy ~ log(Adult.Mortality) + infant.deaths + Polio + I(Polio ^ 2) + HIV.AIDS + I(HIV.AIDS^2) + log(thinness.5.9.years) + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS, data = data.train)
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
show_metrics(model_trans)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 8.40002629693026e-11 | 3.43722742650363e-07 | 0.836954948860762 | 3.58210034060524 | 12.530477415429 | 15.8469601746367 |
# After transformation, it shows that bptest and shapiro test results improves a little.
MODEL COMPARISON + BEST MODEL + TRAIN vs TEST STATS
Ideas for furhter improvemnts
Put all not important stuff here